A step back

About Replication…

  • Imagine we try to replicate a “significant” study from the psychological literature…
  • What would be the probability of getting a result similar to the original?
  • And in a perfect world, what should that probability be?

The Reality

Open Science Collaboration (2015):

  • Original Studies: 97% significant

  • Replications: Only 36% significant

  • Subjective Replication Rate: 39%

  • Effect Magnitude: Halved


So…Overall Replication Success was less than 50%!

Replication Crisis

The systematic failure to replicate or reproduce original findings.


An Important Distinction

  • Replicability
    Same Design + New Data \(\rightarrow\) Similar Results
  • Reproducibility
    Same Data + Same Analysis \(\rightarrow\) Identical Results

The Wider Credibility Crisis


Theory Crisis

(Oberauer & Lewandowsky, 2019)

  • Vague and flexible theories

  • Weak theory-hypothesis links

  • Data fails to kill bad theories

Validity Crisis

(Shimmack, 2021)

  • Unknown validity of instruments

  • Too many measures


  • Garbage In, Garbage Out

Replication Crisis

Causes:

  • Questionable Research Practices (QRPs, John et al., 2012)

  • Publication bias

  • “Publish or perish!” (Callard, 2022)

  • Researcher degrees of freedom (Simmons et al., 2011)

Inflation of false positives

Research Question: Is CBT effective for treating Depression?

  • Data Collection:
    • Measures: Self-report (BDI-II) or Clinician-rated (HAM-D)?
    • Outcome: Symptom reduction, Remission rate, or Functionality?
  • Data Processing:
    • Exclusion: Removing Outliers (> 1.5 SD or > 2.5 SD?)
    • Cleaning: Managing Missing data (Imputation vs. Deletion?)
  • Data Analysis:
    • Covariates: Baseline severity, Medication, Gender, SES?
    • Model: ANCOVA, Mixed Models (LMM), Generalized Linear Model (GLM)?

Researcher degrees of freedom: The ambiguous space of equally justifiable choices a researcher makes during the research process, which can drastically alter the final results. (Simmons et al., 2011)

Research Question: Is CBT effective for treating Depression?

  • So: we choose a path and run our analysis \(\rightarrow\) No significant result 😞

  • But we need this publication! (Only significant results get published)

  • What do we do?

  • Let’s try another path (e.g., remove outliers)!

  • And another one (change covariates)…

  • And another one… Until we get a Significant Result (\(p < .05\))!

The Problem:

By trying multiple paths, we drastically inflated the False Positive Rate

\(\rightarrow\) Multiple Comparisons

The Multiple Comparisons Problem


Coin Toss Example:

  • What is the probability of getting at least one Head?
    • One Toss: \(P(\text{Head}) = 0.50\)
    • Ten Tosses: \(P(\ge 1 \text{ Head}) = 1 - (1 - 0.5)^{10} \approx 0.999\)

The probability of getting what we want approaches 100%!

Psychology’s “Coin Toss”

  • In Psychology, our coin has a probability of Head of \(5\%\)
  • Then we can keep tossing this coin until we win
  • Ten Hypothesis tests: probability of at least one significant result ~ 40%
  • We eventually get our significant \(p < .05^*\)

The problem is that Head = False Positives…

Replication Crisis

Proposed solutions:

  • Open science

  • Pre-registration

  • Registered Reports

  • Multiverse Analysis (Steegen et al., 2016)

Multiverse Analysis



Multiverse of analytical scenarios
(data collection, coding and analysis)

Analysis and presentation of results
from every plausible scenario

Descriptive methods

Inferential methods

Descriptive Multiverse Analysis

Descriptive Multiverse Analysis

Limitations:

  • No Control for Multiple Comparisons
    • Without a formal correction, we can’t test the “significance” of the Multiverse
  • Risk of High Type I Error \(\rightarrow\) False Positives
    • Merely observing p-values does not correct for the probability of chance results
  • No Statistical Inference
    • Conclusions limited to the specific dataset \(\rightarrow\) cannot generalize to the population

Inferential Multiverse Analysis

Inferential Multiverse Analysis

PIMA:

  • Formal P-value Adjustment for Multiple Comparisons
    • Mathematically corrects for the Garden of forking paths (numerous tests)
  • Strong Control of Type I Error
    • Keeps rate of false positives \(\leq 5\%\)
  • Good Statistical Power
  • Enables Selective Inference
    • Allows us to safely generalize specific significant results to the population

Your Meta-Analyses

Task: “Is psychotherapy effective for treating depression?”

Dataset

study es_id yi vi target_group format diagnosis type control region risk_of_bias
aagaard, 2017 1 0.00 0.05 adults group diagnosis cbt-based cau europe some concern
allart van dam, 2003 11 0.57 0.04 adults group subclinical depression cbt-based cau europe some concern
andersson, 2005 17 0.95 0.05 adults guided self-help cut-off score cbt-based other ctr europe some concern
andersson, 2005 18 0.79 0.05 adults guided self-help cut-off score cbt-based other ctr europe some concern
arjadi, 2018 31 0.39 0.01 adults guided self-help diagnosis cbt-based other ctr other region low

Results

Task: “Is psychotherapy effective for treating depression?”


Analyst Effect Size (g) 95% CI Significance Summary
Felix 1 0.72 [0.61, 0.83] Significant 0.72 [0.61, 0.83]
Felix 2 0.77 [0.53, 1.02] Significant 0.77 [0.53, 1.02]
Kevin 0.72 [0.61, 0.83] Significant 0.72 [0.61, 0.83]
Julia 0.72 [0.61, 0.83] Significant 0.72 [0.61, 0.83]
Isabella 0.72 [0.61, 0.83] Significant 0.72 [0.61, 0.83]
Isabella No Bias 0.47 [0.37, 0.57] Significant 0.47 [0.37, 0.57]
Isabella No Outliers 0.65 [0.61, 0.69] Significant 0.65 [0.61, 0.69]
Tobias 0.71 [0.63, 0.79] Significant 0.71 [0.63, 0.79]
Tobias No Outlier 0.64 [0.59, 0.69] Significant 0.64 [0.59, 0.69]
Tobias No Bias 0.48 [0.42, 0.54] Significant 0.48 [0.42, 0.54]

Results

Task: “Is psychotherapy effective for treating depression?”

Results

Task: “Is psychotherapy effective for treating depression?”

So…is psychotherapy effective for treating depression?

Conclusion: Yes, but the estimated effect depends on the choices we make!

Where can meta-analytical paths diverge?

  • The Model: Fixed-effect vs Random-effects
  • Data Analysis: Outlier exclusion, Bias control, Handling multiple outcomes…
  • The Metric: Cohen’s d vs Hedges’ g

So… Which result is the “correct” one?

Multiverse Meta-Analysis

Case Study - Results

Summary effects (k = 1144)



Mdn = 0.59

\(\boldsymbol{\bar{x}}\) = 0.63

Min-Max = [0.28-1.61]

Clinical Significance \(\geq\) 0.24
(Cuijpers et al., 2014)

P-value Adjustment (maxT)


Significant Meta-Analyses

Never = 8 (0.7%)

Before correction = 1136 (99.3%)

After correction = 1030 (90%)

Implications

  • Inferential Multiverse Meta-Analysis

    → to enhance transparency and robustness

  • Addressing selective reporting and p-hacking
  • Relative stability of findings on the effectiveness of psychotherapies for depression

Limitations

  • Simplification of dataset and analyses
  • No multilevel meta-analyses
  • No quantitative assessment of publication bias

Future directions

  • PIMMA to consolidate knowledge and evidence in psychology
  • Extend the method to multilevel and/or multivariate meta-analyses
  • R package on the way!

Future directions

  • Book: Multiverse Analysis in R – Exploratory and Inferential Approaches (forthcoming 2026)
  • Authors: Altoè, G., Gambarota, F., Girardi, P., Vesely, A., Calignano, G., Manente, M., Pastore, M., Finos, L.

Multiverse Meta-Analysis
How?

Step 1. Creating the Multiverse

Scenarios Model Therapy Format Bias Diagnosis
\(m_1\) EE CBT Individual High Clinical
\(m_2\) RE Non-CBT Group Low Cut-off
\(m_{1920}\) RE All All All All


  • Compute Meta-Analysis for each scenario (\(m_i\))
  • Include Meta-Analyses with at least 10 studies (\(k \geq 10\))

Step 2. Score calculation

  • Compute score \(z_k\) for every study k in each scenario \(m_i\)

\[ z_k = \frac{y_k}{v_k + \tau_0^2} \]

\(y_k\) = effect size estimate from study k

\(v_k\) = variance of study k

\(\tau_0^2\) = between-study variance under the null (\(H_0\))

Step 3. Score Matrix

  • Store the scores \(z_k\) in a matrix [k x m]

    → Rows = primary studies scores (\(z_k\))

    → Columns = scenarios/meta-analyses (\(m_i\))

\(m_1\) \(m_2\) \(m_i\) \(m_{1144}\)
\(k_1\) 0.34 0.28 0.00
\(k_2\) - 0.25 0.00 -0.25
\(k_{124}\) 0.00 0.52 0.48

Null values (= 0.00) indicate study \(k_i\) was not included in meta-analysis \(m_i\)

Step 4. Permutation-based Inference

  • Sign-flipping score test (see Girardi et al., 2024)

    1. Randomly multiply each row \(k_i\) by +1 or -1 (sign-flipping)
      \(\rightarrow\) equivalent to re-sampling under the null hypothesis
    2. Compute the test statistic (sum of scores) for each permuted scenario \(m_i\)
    3. Repeat B times \(\rightarrow\) null distribution of each scenario \(m_i\)
    4. Compare observed scores vs permuted null distribution
      \(\rightarrow\) raw & adjusted p-values (maxT)

Takeaways

Meta-Analysis

What is the right way to perform a research synthesis? There is no single right way. It all depends on the purpose of the synthesis.


(Borenstein, 2009)

Multiverse Analysis

  • Be thoughtful

    → Include only equally defensible choices (Del Giudice & Gangestad, 2021)

  • Be parsimonious

    → Include only well-justified models for statistical power

  • Be exhaustive

    → Account for all relevant variables to avoid inflated false positive rates

Methodological Guiding Principles

  • Thinking before testing
  • Focus on Effect Sizes, NOT just p-values
  • Open Science & Transparency

Keep in Touch: The Psicostat Group

  • An interdisciplinary research group: Psychology & Statistics
  • Curiosity over certainty: open questions, humility, and learning from mistakes
  • Join the conversation: Bi-weekly open meetings online
    • Every two Fridays at 12:00 PM CET via Zoom

Scan to join the mailing list!

Extra

Bonferroni Correction

  • Easy fix for keeping the false positives rate \(< 5\%\)

\[ \alpha_{adj} = \frac{\alpha_{standard}}{\text{number of tries (k)}} \]

  • Example: if we do ten different tests on different data

\[ \alpha_{adj} = \frac{0.05}{10} = 0.005\]

Now we are safe from false positives

but it’s incredibly hard to find true effects!

The Multiverse Reality

  • Bonferroni assumes tests are independent (like coin tosses)
  • But Multiverse specifications are highly correlated:
    \(\rightarrow\) similar tests on the same data
  • Bonferroni ignores the correlations
    \(\rightarrow\) Massive Power Loss (High Type II Error)
  • maxT adjustment:
    1. empirically models the correlation structure via permutations
    2. corrects for multiplicity without killing statistical power

Bonferroni vs maxT